Skip to content

Conversation

@frasertweedale
Copy link
Contributor

cgroups v2 supports secure delegation of cgroups. Accordingly,
control over a cgroup (that is, creation of new child cgroups and
movement of processes and threads among the cgroup subtree exposed
to a container) can be safely delegated to a container. Adjusting
the ownership enables real-world use cases like systemd-based
containers fully isolated in user namespaces.

To encourage adoption of this feature, and secure implementation,
define the semantics of cgroup ownership. Changing/setting the
cgroup ownership is only allowed on cgroups v2, and the specific
files whose ownership can be change are mentioned.

In terms of current practice, this is already the behaviour of crun
(which also chown's the memory.oom.group file), and there is a pull
request for runc: opencontainers/runc#3057
(the behaviour is enabled by an annotation).

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable but using MUSTs is far too strong IMHO.

@frasertweedale
Copy link
Contributor Author

@cyphar thanks for your comments. I'll push an update later today, or tomorrow.

@frasertweedale
Copy link
Contributor Author

PR updated to fix a grammar issue and relax MUST to SHOULD per @cyphar's comments.

@kolyshkin
Copy link
Contributor

One question: should we also require cgroupns? It is assumed with cgroup v2, but what it it is not set in config?

I have just checked that by removing cgroup ns from config.json, and the whole host /sys/fs/cgroup is mounted (for both crun and runc). Meaning we definitely need to add cgroupns to the list of conditions.

@frasertweedale
Copy link
Contributor Author

frasertweedale commented Sep 23, 2021

One question: should we also require cgroupns? It is assumed with cgroup v2, but what it it is not set in config?

I have just checked that by removing cgroup ns from config.json, and the whole host /sys/fs/cgroup is mounted (for both crun and runc). Meaning we definitely need to add cgroupns to the list of conditions.

@kolyshkin good point, I agree. Should we further restrict it to a cgroupns without path specified? That is, where the runtime must create a new cgroupns? Otherwise path could be a reference to an arbitrary, pre-existing cgroupns (possibly the "root" host cgroupns).

edit I pushed an update that adds this requirement, with the empty path as I proposed.

@frasertweedale
Copy link
Contributor Author

Are there any further comments about this PR. Any further concerns? What else needs to be done for this PR to be ready to merge?

Copy link
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sane to me

@cyphar @kolyshkin willing to take another look? 😅🙏

@frasertweedale
Copy link
Contributor Author

weekly nudge...

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with the two commits squashed).

@frasertweedale
Copy link
Contributor Author

@kolyshkin thank you. I squashed the commits.

@frasertweedale
Copy link
Contributor Author

Seeking more reviews, and hopefully approvals, for this enhancement. Ping @crosbymichael, @cyphar, @dqminh, @giuseppe, @hqhq, @mrunalp, @vbatts, @vishh.

config-linux.md Outdated

- `cgroup.procs`
- `cgroup.subtree_control`
- `cgroup.threads`
Copy link

@ajwock ajwock Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to mention that there is a file named /sys/kernel/cgroup/delegate (since linux 4.15) which contains a list of delegable files for cgroups. This allows further files to be delegated in the future. My fedora 33 machine has the 3 files mentioned above as well as a new one, memory.oom.group in its /sys/kernel/cgroup/delegate file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Do folks think it makes sense to update runtime-spec to instruct runtimes to read the files to be delegated from /sys/kernel/cgroup/delegate instead (possibly with the current list as fallback)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, it should be worded like "The runtime SHOULD use the list of files from /sys/kernel/cgroup/delegate. If that file can not be used, the runtime MAY fall back to the following list" and these three files.

@ajwock
Copy link

ajwock commented Oct 20, 2021

Oh, also I'm curious- is there any reason it would be insecure for v2 cgroups to be delegated to a user other than root while running in the host namespace? Requiring a user namespace before you can manage the cgroup hierarchy adds baggage that would be less than ideal for some use cases, I'm sure.

@frasertweedale
Copy link
Contributor Author

frasertweedale commented Oct 21, 2021

Oh, also I'm curious- is there any reason it would be insecure for v2 cgroups to be delegated to a user other than root while running in the host namespace? Requiring a user namespace before you can manage the cgroup hierarchy adds baggage that would be less than ideal for some use cases, I'm sure.

We could chown to process.user.uid in the namespace, instead of uid 0 unconditionally. That is, the user that will actually execute the initial process in the container. Ordinarily that would be uid 0, but it's possible to specify configs that do something different, and even to specify a config where uid 0 is unmapped in the namespace.

Specifying that behaviour would be a relaxation of the cgroup ownership semantics as currently proposed, thus backwards compatible. Is there a use case that compels us to accommodate this scenario right now? Or shall we keep it simple for now and only chown to the namespace's uid 0 (whether or not that is the uid that will execute the process, and failing if uid 0 is unmapped)?

@ajwock
Copy link

ajwock commented Oct 21, 2021

Personally, I have a use case for managing cgroups within a container but I would prefer not to add the overhead of a user namespace if it isn't necessary.

@giuseppe and others do you see any problem with allowing this?

@frasertweedale
Copy link
Contributor Author

Personally, I have a use case for managing cgroups within a container but I would prefer not to add the overhead of a user namespace if it isn't necessary.

@giuseppe and others do you see any problem with allowing this?

The proposed behaviour doesn't require a (new) user namespace - only that the cgroup should be owned by uid 0 in the container's user namespace (which may be the host user namespace). So to admit your use case, we just need to change it to say that the cgroup should be owned by the host uid that maps to the value of process.user.uid in the container user namespace. Does anyone object to this change?

cgroups v2 supports secure delegation of cgroups.  Accordingly,
control over a cgroup (that is, creation of new child cgroups and
movement of processes and threads among the cgroup subtree exposed
to a container) can be safely delegated to a container.  Adjusting
the ownership enables real-world use cases like systemd-based
containers fully isolated in user namespaces.

To encourage adoption of this feature, and secure implementation,
define the semantics of cgroup ownership.  Changing/setting the
cgroup ownership should only be performed when:

- using cgroups v2, and
- container will have a new cgroup namespace, and
- cgroupfs will be mounted read/write.

The specific files whose ownership should be changed are listed.

In terms of current practice, this is already the behaviour of crun
(which also chown's the memory.oom.group file), and there is a pull
request for runc: opencontainers/runc#3057.

Signed-off-by: Fraser Tweedale <[email protected]>
@frasertweedale
Copy link
Contributor Author

Weekly bump. Could someone please merge it, or explain what prevents it?

@frasertweedale
Copy link
Contributor Author

Weekly bump (Nov 22). Could someone please merge it, or explain what else needs to be done or is preventing this PR from being merged?

@vbatts
Copy link
Member

vbatts commented Nov 23, 2021

@frasertweedale thank you for the weekly bumps 😬
Looks like the "MUST" that @cyphar commented on was never marked resolved. Perhaps that's why the other maintainers didn't merge? 🤷

LGTM

@vbatts vbatts merged commit c2389c3 into opencontainers:main Nov 23, 2021
@frasertweedale frasertweedale deleted the cgroup-ownership branch November 23, 2021 23:54
@frasertweedale
Copy link
Contributor Author

@vbatts thank you so much! Thanks to all for their feedback and insights.

frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 24, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Nov 29, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
frasertweedale added a commit to frasertweedale/runc that referenced this pull request Dec 7, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
AlexeyPerevalov pushed a commit to AlexeyPerevalov/runc that referenced this pull request Dec 14, 2021
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
@AkihiroSuda AkihiroSuda mentioned this pull request Jan 24, 2023
@AkihiroSuda AkihiroSuda added this to the v1.1.0 milestone Feb 1, 2023
@AkihiroSuda AkihiroSuda mentioned this pull request Jun 26, 2023
12 tasks
dims pushed a commit to dims/libcontainer that referenced this pull request Oct 19, 2024
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
dims pushed a commit to dims/libcontainer that referenced this pull request Oct 19, 2024
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
dims pushed a commit to dims/libcontainer that referenced this pull request Oct 19, 2024
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
kolyshkin pushed a commit to kolyshkin/containerd-cgroups that referenced this pull request Nov 6, 2024
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  opencontainers/runtime-spec#1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants